Introduction

The objective of this second assignment is to explore the differentially expressed genes from the cleaned and normalized data in assignmnent one. Then rank the thresholded over-representation analysis to highlight the top terms / dominant themes in the top set of genes. Lastly, compare my result with the original literature and find some other supports as well for my result if possible. I make some changes to my assignment one and stored the file as “ammended_A1.Rmd” and imported it here. I will demonstrate briefly what I have changed and I general workflow in A1.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!requireNamespace("limma", quietly = TRUE))
    BiocManager::install("limma")

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")

if (!requireNamespace("edgeR", quietly = TRUE))
    BiocManager::install("edgeR")

if (!requireNamespace("EnsDb.Hsapiens.v75", quietly = TRUE))
    BiocManager::install("EnsDb.Hsapiens.v75")

if (!requireNamespace("knitr", quietly = TRUE))
    BiocManager::install("knitr")
if (!requireNamespace("gprofiler2", quietly = TRUE))
    BiocManager::install("gprofiler2")
library(limma)
library(edgeR)
library(EnsDb.Hsapiens.v75)
library(knitr)
library(ComplexHeatmap)
library(circlize)
library(gprofiler2)

A1

My data is from GSE84054

What I have changed in A1: I decided to leave the genes that have duplicated names because Ruth sugguested that they are a very small number compare to my total number of genes, ~1%. So I deleted many steps in A1, and made the cleaning step clearer by showing boxplots, density plots and MDS plots in each cleaning step.

  1. Boxplot - prefiltered

  2. Density Plot - prefiltered

  3. MDS plot - prefiltered

  1. Boxplot - filtered

  2. Density plot - filtered

  3. MDS plot - filtered

A2

Load data

Loaded my normalized data from A1.

normalized_count_data <- read.table(file="GSE84054_normalized_count.txt")
kable(normalized_count_data[1:5, 1:5], type="html")
GENEID SYMBOL RHB037 RHB038 RHB041
ENSG00000000003 ENSG00000000003 TSPAN6 79.809196 63.375225 37.719858
ENSG00000000419 ENSG00000000419 DPM1 36.461054 38.451386 49.614654
ENSG00000000457 ENSG00000000457 SCYL3 12.060195 12.077110 6.329049
ENSG00000000460 ENSG00000000460 C1orf112 6.762435 4.972927 3.848316
ENSG00000000938 ENSG00000000938 FGR 1.402348 3.966502 1.940060
## Explore: Normal ized
1. boxplot - norma lized
# After normalization
data2plot_after <- log2(normalized_count_data[,3:ncol(normalized_count_data)])
{boxplot(data2plot_after, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Filtered and normalized RNASeq Samples")
abline(h = median(apply(data2plot_after, 2, median)), col = "green", lwd = 0.6, lty = "dashed")}

  1. density plot - normalized
counts_density <- apply(log2(normalized_count_data[,3:ncol(normalized_count_data)]), 2, density)
  #calculate the limits across all the samples
    xlim <- 0; ylim <- 0
    for (i in 1:length(counts_density)) {
      xlim <- range(c(xlim, counts_density[[i]]$x)); 
      ylim <- range(c(ylim, counts_density[[i]]$y))
    }
    cols <- rainbow(length(counts_density))
    ltys <- rep(1, length(counts_density))
    #plot the first density plot to initialize the plot
    plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
         ylab="Smoothing density of log2-CPM",cex.lab = 0.85, 
         main = "Filtered and normalized RNASeq Samples distribution")
    #plot each line
    for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
    #create legend
    legend("topright", colnames(data2plot), col=cols, lty=ltys, cex=0.75, 
           border ="blue",  text.col = "green4", merge = TRUE, bg = "gray90")

PART #1: differential expression

STEP 1: Choice of factors in my model

  • I created the MDS by both using “cell_type” and “cell_type” and “patient”. The comparison between the two models is fairly clear: we should only depende on the factor “cell_type”, since there seems to be no correlation with each patient.
heatmap_matrix <- normalized_count_data[,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GENEID
colnames(heatmap_matrix) <- rownames(samples_filtered)

# MDS plot by "cell_type" in samples
plotMDS(heatmap_matrix, labels=rownames(samples_filtered), 
        col = c("darkgreen","blue")[factor(samples_filtered$cell_type)],
        main = "MDS plot depending on cell type")

pat_colors <- rainbow(12)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
# MDS plot by "cell_type" + "patients"in samples
plotMDS(heatmap_matrix, col = pat_colors,
        main = "MDS plot depending on both cell type and patients")

STEP 2: Define my model design

Based on the two models in part1, I decide to base my model only on “cell_type”

model_design <- model.matrix(~ samples$cell_type)
kable(model_design, type="html")
(Intercept) samples$cell_typeSphere
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1

STEP 3: Calculate p-value

There are 6988 genes that pass the p-value = 0.05 which is chosen based on the paper.

expressionMatrix <- as.matrix(normalized_count_data[,3:ncol(normalized_count_data)])
rownames(expressionMatrix) <- normalized_count_data$GENEID
colnames(expressionMatrix) <- colnames(normalized_count_data)[3:ncol(normalized_count_data)]
minimalSet <- ExpressionSet(assayData=expressionMatrix)

# fit
fit <- lmFit(minimalSet, model_design)

# Use Bayes
fit2 <- eBayes(fit,trend=TRUE)

# Correction: BH
topfit <- topTable(fit2, coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))

#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2], topfit, by.y = 0, by.x = 1, all.y=TRUE)

#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
kable(output_hits[1:5,],type="html")

# number of genes that pass threshold p-value = 0.05
length(which(output_hits$P.Value < 0.05)) # 6988

# number of genes that pass correction
length(which(output_hits$adj.P.Val < 0.05)) # 4062

STEP 4: Set up EdgeR object

d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)

STEP 5: Test whether my data is suitable for edgeR - MeanVar plot

I have shown that my data is suitable for using edgeR for further analysis. The data follows the binomial distribution.

plotMeanVar(d, show.raw.vars = TRUE,                
            show.tagwise.vars=TRUE,                 
            show.ave.raw.vars = TRUE,                                                         
            NBline=TRUE,
            show.binned.common.disp.vars = TRUE,
            main = "Binomial distribution of my data")

STEP 6: Estimate dispersion - BCV plot

The individual dots represent each gene and the blue line is the overall trend line.

plotBCV(d,col.tagwise = "black",col.common = "red", 
        main = "BCV plot of RNA-seq data")

STEP 7: Genes pass threshold and FDR correction

I used Quasi-likelihood models to fit my data and used QLFTest to test for differential expression. The Quasi-likelihood compares two conditions (primary tumour and tumoursphere) and shows the up and down-regulated genes. I have showed the result below that are sorted by p-value. I also inspected the number of genes that satisty my threshold and correction. I choose to use FDR correction based on the paper as well(Goh et al. 2017) . There are 7467 genes pass the p-value = 0.05, and 5033 genes that pass the FDR correction.

fit <- glmQLFit(d, model_design)
qlf.sphere_vs_tumour <- glmQLFTest(fit, coef='samples$cell_typeSphere')
kable(topTags(qlf.sphere_vs_tumour), type="html")

# Get all the results
qlf_output_hits <- topTags(qlf.sphere_vs_tumour, 
                           sort.by = "PValue", 
                           n = nrow(normalized_count_data))

# Number of genes that pass the threshold p-value = 0.05
length(which(qlf_output_hits$table$PValue < 0.05)) # 7467

# Number of genes that pass correction
length(which(qlf_output_hits$table$FDR < 0.05)) # 5033

STEP 8: Up and down-regulated genes

I determined the number of up-regulated genes by selecting every gene that does not pass my p-value: 0.05, and also have a positive log fold change. Down-regulated genes are selected in the same way with a negative log fold change. Stored these data for later enrichment analysis.

# How many genes are up regulated?
length(which(qlf_output_hits$table$PValue < 0.05 
             & qlf_output_hits$table$logFC > 0)) # 1897

# How many genes are down regulated?
length(which(qlf_output_hits$table$PValue < 0.05  
             & qlf_output_hits$table$logFC < 0)) # 5570

# Get those up and down-regulated genes
qlf_output_hits_withgn <- merge(expr_filtered[,1:2],qlf_output_hits, by.x=1, by.y = 0)

upregulated_genes <- qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05 
                                                         & qlf_output_hits$table$logFC > 0)]

downregulated_genes <-qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05 
                                                           & qlf_output_hits$table$logFC < 0)]


# store data
unreg_genes_copy <- data.frame(upregulated_genes)
downreg_genes_copy <- data.frame(downregulated_genes)
names(unreg_genes_copy) <- names(downreg_genes_copy)
all_de <- rbind(unreg_genes_copy, downreg_genes_copy)
colnames(all_de) <- "all_de"
write.table(x=all_de,
            file="all_expr_de_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

write.table(x=upregulated_genes,
            file="expr_upregulated_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

write.table(x=downregulated_genes,
            file="expr_downregulated_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

STEP 9: Show up and down-regulated genes

I have shown the up and down-regulated genes in a volcano plot by coloring them in red and blue, the code is from (Annick Moisan, n.d.)

volcanoData <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(volcanoData) <- c("logFC", "Pval")

up <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 0
point.col <- ifelse(up, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
     main = "Up-regulated genes in RNA-seq data")

down <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(down, "blue", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
      main = "Down-regulated genes in RNA-seq data")

STEP 10: Test Differential expression - heatmap

To test the differential expression, I used the heatmap and it has shown a clear distinction between up and down regulated genes. There is a clear difference between the primary tumour samples and tumoursphere samples.(They are reversed.) The clustering is very obvious to note that differential expression exists. s

top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05] 
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),]))) 
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits),pattern = "_P"), 
                                                    grep(colnames(heatmap_matrix_tophits),pattern = "_S"))]

if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
    } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
    }

current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = FALSE,
                               show_row_dend = TRUE,
                               show_column_dend = FALSE,
                               col=heatmap_col,
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,)
current_heatmap

PART 2: Thresholded over-representation analysis

Introduction to PART 2:

  1. Which method did you choose and why?
  • I chose to use g:profiler because it shows me the top term names in KEGG, WP, GO and REAC, which is helpful for me when deciding what type of disease it is most likely to be.
  1. What annotation data did you use and why? What version of the annotation are you using?
  • GO biological process: releases/2019-07-01
  • KEGG: KEGG FTP Release 2019-09-30
  • Reactome: ensembl classes: 2019-10-2
  • WikiPathways: 20190910
  1. How many genesets were returned with what thresholds?
  • The threshold for all the queries: 0.05
  • 821 gene sets are returned for all the differentially expressed genes.
  • 60 gene sets are returned for up-regulated genes.
  • 1333 gene sets are returned for down-regulated genes.
  1. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
  • The up-regulated gene sets are mostly cellular processes and metabolic processes.
  • The top terms for down-regulated gene sets are mostly signaling pathways and metabolic processes/
  • When running with both up and down-regulated genes, I found that most of them are dominated by the metabolic and cellular processes (up regulated top terms). Therefore, it convinces me the importance of the up-regulated genes in the cancer. And the result aligns with the paper that states the cancer is of subtype “basal”.
  • Below are my results for each up-regulated, down-regulated, and differentially expressed (both).

Up regulated genes

There were no REAC found. I used the gprofiler2’s function to query data and also attached the screenshots that I took on their website since the package does not show the number of gene sets each has found.

Down regulated genes

The some analysis is apply to down regulated

All differentially expressed genes

Interpretation questions

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper?
upregulated_genes_sym <- qlf_output_hits_withgn$SYMBOL[which(qlf_output_hits$table$PValue < 0.05 
                                                         & qlf_output_hits$table$logFC > 1)]

length(upregulated_genes_sym) # 1312


upregulated_genes_sym[grep(pattern="ALDH",upregulated_genes_sym)]
# ALDH2, ALDH8A1, ALDH1L2
  1. Evidence that support and how they support your results.

References

Annick Moisan, Nathalie Villa-Vialaneix, Ignacio Gonzales. n.d. “Practical Statistical Analysis of Rna-Seq Data - edgeR - Tomato Data.”

Goh, Jian Yuan, Min Feng, Wenyu Wang, Gokce Oguz, Siti Maryam JM Yatim, Puay Leng Lee, Yi Bao, et al. 2017. “Chromosome 1q21. 3 Amplification Is a Trackable Biomarker and Actionable Target for Breast Cancer Recurrence.” Nature Medicine 23 (11). Nature Publishing Group: 1319.

Vasiliou, Vasilis, and Daniel W Nebert. 2005. “Analysis and Update of the Human Aldehyde Dehydrogenase (Aldh) Gene Family.” Human Genomics 2 (2). BioMed Central: 138.

Vassalli, Giuseppe. 2019. “Aldehyde Dehydrogenases: Not Just Markers, but Functional Regulators of Stem Cells.” Stem Cells International 2019. Hindawi.